Cars4U is a tech start-up that aims to find footholes in the used car sales market. The goal of this project is to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.
In this prokect, we will explore and visualize the dataset, build a linear regression model to predict the prices of used cars, and generate a set of insights and recommendations that will help the business. The data contains the different attributes of used cars sold in different locations. The detailed data dictionary is given below.
# import all the python packages that will be needed. If any package is missing, install it first using this command:
# !pip install <package name>
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt # seaborn is based on matplotlib
import matplotlib.ticker as tkr
sns.set(color_codes=True) # adds a nice background to the graphs
%matplotlib inline
from pandas_profiling import ProfileReport
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split # Sklearn package's randomized data splitting function
import warnings
warnings.filterwarnings("ignore") # During the final run, warnings can be disabled.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics import mean_absolute_percentage_error
# Read in the data file. Get the row/column counts. See a few rows to make sure the reading was done correctly
cdata = pd.read_csv('used_cars_data.csv')
print("The dataset has",cdata.shape[0],"tuples and",cdata.shape[1],"attributes.")
print('')
print(cdata.dtypes)
print('')
cdata.head()
The dataset has 7253 tuples and 14 attributes. S.No. int64 Name object Location object Year int64 Kilometers_Driven int64 Fuel_Type object Transmission object Owner_Type object Mileage object Engine object Power object Seats float64 New_Price float64 Price float64 dtype: object
| S.No. | Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_Price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Maruti Wagon R LXI CNG | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.6 km/kg | 998 CC | 58.16 bhp | 5.0 | 5.51 | 1.75 |
| 1 | 1 | Hyundai Creta 1.6 CRDi SX Option | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 kmpl | 1582 CC | 126.2 bhp | 5.0 | 16.06 | 12.50 |
| 2 | 2 | Honda Jazz V | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.2 kmpl | 1199 CC | 88.7 bhp | 5.0 | 8.61 | 4.50 |
| 3 | 3 | Maruti Ertiga VDI | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 kmpl | 1248 CC | 88.76 bhp | 7.0 | 11.27 | 6.00 |
| 4 | 4 | Audi A4 New 2.0 TDI Multitronic | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.2 kmpl | 1968 CC | 140.8 bhp | 5.0 | 53.14 | 17.74 |
# Let's keep a copy of the original dataset so that we will have access to it as we do field transformations on the main dataframe.
cdata_orig = pd.read_csv('used_cars_data.csv')
# We have enough number of rows and hence the training/test samples will be large enough to build a regression model.
# S.No is the id field and can be dropped. Name is a string and for this project, we will not try and extract any value from it. In the future, we may be able to extract
# make, model and trim level and use it. But for this project, we will just drop it.
# Data types above are as expected. But Mileage, Engine and Power need to be preprocessed to extract the numeric part before we can use them in regression.
# Let's see some basic stats (mean, min, max, median etc.) of the columns. For non-numeric columns, not all metrics make sense. These will show up as NaN.
cdata.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | 7253 | NaN | NaN | NaN | 3626 | 2093.91 | 0 | 1813 | 3626 | 5439 | 7252 |
| Name | 7253 | 2041 | Mahindra XUV500 W8 2WD | 55 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Location | 7253 | 11 | Mumbai | 949 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Year | 7253 | NaN | NaN | NaN | 2013.37 | 3.25442 | 1996 | 2011 | 2014 | 2016 | 2019 |
| Kilometers_Driven | 7253 | NaN | NaN | NaN | 58699.1 | 84427.7 | 171 | 34000 | 53416 | 73000 | 6.5e+06 |
| Fuel_Type | 7253 | 5 | Diesel | 3852 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Transmission | 7253 | 2 | Manual | 5204 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Owner_Type | 7253 | 4 | First | 5952 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Mileage | 7251 | 450 | 17.0 kmpl | 207 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Engine | 7207 | 150 | 1197 CC | 732 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Power | 7078 | 385 | 74 bhp | 280 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Seats | 7200 | NaN | NaN | NaN | 5.27972 | 0.81166 | 0 | 5 | 5 | 5 | 10 |
| New_Price | 7253 | NaN | NaN | NaN | 21.3073 | 24.2563 | 3.91 | 7.88 | 11.3 | 21.69 | 375 |
| Price | 6019 | NaN | NaN | NaN | 9.47947 | 11.1879 | 0.44 | 3.5 | 5.64 | 9.95 | 160 |
# Let's check if there are any blanks
cdata.isnull().sum()
S.No. 0 Name 0 Location 0 Year 0 Kilometers_Driven 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Mileage 2 Engine 46 Power 175 Seats 53 New_Price 0 Price 1234 dtype: int64
# There are quite a few blanks. "Price" is the target variable. This is blank for a large # of rows (~17%).
# Wherever the target field is blank, we will drop those rows.
# For other attributes, we will use mean/median/mode to fill the blank cases.
# Let's use the pandas profile package to create a basic report on the dataset
# The profile report is in a cell with its own scrolling. So scroll within the cell to see the entire report.
profile = ProfileReport(cdata);
profile
# Save the report file as an HTML for future viewing
profile.to_file("cdata_profile2.html");
# We have done some basic analysis so far.
# Before starting the univariate analysis, let's perform a pre-processing of the Mileage, Engine and Power and get the number part of the given values.
# Extract the mileage part and store it in a new column called Mileage_num
cdata['Mileage_num']=(cdata['Mileage']).str.split(' ',expand=True)[0]
# Extract the second part of the string representing the mileage unit
cdata['Mileage_unit']=(cdata['Mileage']).str.split(' ',expand=True)[1]
# Convert the Mileage_num to float so that we can do numerical analysis and use it as an input to the regression model
cdata['Mileage_num'] = cdata['Mileage_num'].str.strip().astype(dtype = 'float64')
# Let's look at the units
cdata['Mileage_unit'].value_counts()
kmpl 7177 km/kg 74 Name: Mileage_unit, dtype: int64
Most of the tuples have the mileage in kmpl unit. A few are in km/kg. As explained in the Project FAQs, there is no need to convert from one unit to another. We can use the numbers as they are.
# Let's check how the dataset structure looks like, paying attention to the data type of the newly added Mileage_num column.
cdata.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7253 entries, 0 to 7252 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 S.No. 7253 non-null int64 1 Name 7253 non-null object 2 Location 7253 non-null object 3 Year 7253 non-null int64 4 Kilometers_Driven 7253 non-null int64 5 Fuel_Type 7253 non-null object 6 Transmission 7253 non-null object 7 Owner_Type 7253 non-null object 8 Mileage 7251 non-null object 9 Engine 7207 non-null object 10 Power 7078 non-null object 11 Seats 7200 non-null float64 12 New_Price 7253 non-null float64 13 Price 6019 non-null float64 14 Mileage_num 7251 non-null float64 15 Mileage_unit 7251 non-null object dtypes: float64(4), int64(3), object(9) memory usage: 906.8+ KB
# We notice above the Mileage_num data type is float. This is what we wanted. So our approach is working.
# Let's check that the number of tuples with missing mileage isn't changed.
cdata[['Mileage','Mileage_num']].isnull().sum()
Mileage 2 Mileage_num 2 dtype: int64
# We see above that the number of tuples with missling Mileage is still 2. So we haven't changed things there.
# Let's see how the data looks. As we are adding columns, let's increase the limit on how many columns are shown.
pd.options.display.max_columns = 50
cdata.head()
| S.No. | Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_Price | Price | Mileage_num | Mileage_unit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Maruti Wagon R LXI CNG | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.6 km/kg | 998 CC | 58.16 bhp | 5.0 | 5.51 | 1.75 | 26.60 | km/kg |
| 1 | 1 | Hyundai Creta 1.6 CRDi SX Option | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 kmpl | 1582 CC | 126.2 bhp | 5.0 | 16.06 | 12.50 | 19.67 | kmpl |
| 2 | 2 | Honda Jazz V | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.2 kmpl | 1199 CC | 88.7 bhp | 5.0 | 8.61 | 4.50 | 18.20 | kmpl |
| 3 | 3 | Maruti Ertiga VDI | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 kmpl | 1248 CC | 88.76 bhp | 7.0 | 11.27 | 6.00 | 20.77 | kmpl |
| 4 | 4 | Audi A4 New 2.0 TDI Multitronic | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.2 kmpl | 1968 CC | 140.8 bhp | 5.0 | 53.14 | 17.74 | 15.20 | kmpl |
We see that Mileage_num column has the number part of the mileage and it is ready to be used as an input to the regression model.
# Let's apply the same approach on Engine attribute.
cdata['Engine_num']=(cdata['Engine']).str.split(' ',expand=True)[0]
cdata['Engine_unit']=(cdata['Engine']).str.split(' ',expand=True)[1]
As we did earlier, let's convert the Engine_num to float.
cdata['Engine_num'] = cdata['Engine_num'].str.strip().astype(dtype = 'float64')
cdata['Engine_unit'].value_counts()
CC 7207 Name: Engine_unit, dtype: int64
All tuples have the same unit. This unit will play no role in the regression model. This field can now be dropped.
cdata[['Engine','Engine_num']].dtypes
Engine object Engine_num float64 dtype: object
# Let's check if we affected the number of tuples with this field missing
cdata[['Engine','Engine_num']].isnull().sum()
Engine 46 Engine_num 46 dtype: int64
The missing count is still 46.
# Here's how the dataset looks now.
cdata.head()
| S.No. | Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_Price | Price | Mileage_num | Mileage_unit | Engine_num | Engine_unit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Maruti Wagon R LXI CNG | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.6 km/kg | 998 CC | 58.16 bhp | 5.0 | 5.51 | 1.75 | 26.60 | km/kg | 998.0 | CC |
| 1 | 1 | Hyundai Creta 1.6 CRDi SX Option | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 kmpl | 1582 CC | 126.2 bhp | 5.0 | 16.06 | 12.50 | 19.67 | kmpl | 1582.0 | CC |
| 2 | 2 | Honda Jazz V | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.2 kmpl | 1199 CC | 88.7 bhp | 5.0 | 8.61 | 4.50 | 18.20 | kmpl | 1199.0 | CC |
| 3 | 3 | Maruti Ertiga VDI | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 kmpl | 1248 CC | 88.76 bhp | 7.0 | 11.27 | 6.00 | 20.77 | kmpl | 1248.0 | CC |
| 4 | 4 | Audi A4 New 2.0 TDI Multitronic | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.2 kmpl | 1968 CC | 140.8 bhp | 5.0 | 53.14 | 17.74 | 15.20 | kmpl | 1968.0 | CC |
Finally, let's preprocess the Power field.
cdata['Power_num']=(cdata['Power']).str.split(' ',expand=True)[0]
cdata['Power_unit']=(cdata['Power']).str.split(' ',expand=True)[1]
cdata['Power_num'] = cdata['Power_num'].str.strip().astype(dtype = 'float64')
# Let's check how many different units we have in the dataset
cdata['Power_unit'].value_counts()
bhp 7078 Name: Power_unit, dtype: int64
# There's only one unit ("bhp"). This field Power_unit can be dropped as it plays no role in regression
cdata[['Power','Power_num']].isnull().sum()
Power 175 Power_num 175 dtype: int64
# Missing count above is unchanged. Let's check the datatypes.
cdata[['Power','Power_num']].dtypes
Power object Power_num float64 dtype: object
We have completed the preprocessing of these three columns.
# Let's drop the attributes that we won't use for regression and take a quick peek at all the fields.
cdata = cdata.drop(['S.No.', 'Name','Mileage','Engine','Power','Mileage_unit','Engine_unit','Power_unit'], axis = 1)
cdata.head()
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Seats | New_Price | Price | Mileage_num | Engine_num | Power_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mumbai | 2010 | 72000 | CNG | Manual | First | 5.0 | 5.51 | 1.75 | 26.60 | 998.0 | 58.16 |
| 1 | Pune | 2015 | 41000 | Diesel | Manual | First | 5.0 | 16.06 | 12.50 | 19.67 | 1582.0 | 126.20 |
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | First | 5.0 | 8.61 | 4.50 | 18.20 | 1199.0 | 88.70 |
| 3 | Chennai | 2012 | 87000 | Diesel | Manual | First | 7.0 | 11.27 | 6.00 | 20.77 | 1248.0 | 88.76 |
| 4 | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 5.0 | 53.14 | 17.74 | 15.20 | 1968.0 | 140.80 |
# Let's look at the column data types
cdata.dtypes
Location object Year int64 Kilometers_Driven int64 Fuel_Type object Transmission object Owner_Type object Seats float64 New_Price float64 Price float64 Mileage_num float64 Engine_num float64 Power_num float64 dtype: object
# Convert the categortical columns to as category
# Location has 11 different values and can affect the Price. Hence treating it as a categorical attribute.
for col in ['Location', 'Fuel_Type', 'Transmission', 'Owner_Type']:
cdata[col] = cdata[col].astype('category')
cdata.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7253 entries, 0 to 7252 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Location 7253 non-null category 1 Year 7253 non-null int64 2 Kilometers_Driven 7253 non-null int64 3 Fuel_Type 7253 non-null category 4 Transmission 7253 non-null category 5 Owner_Type 7253 non-null category 6 Seats 7200 non-null float64 7 New_Price 7253 non-null float64 8 Price 6019 non-null float64 9 Mileage_num 7251 non-null float64 10 Engine_num 7207 non-null float64 11 Power_num 7078 non-null float64 dtypes: category(4), float64(6), int64(2) memory usage: 482.6 KB
All data types look good now. Everything that should be numeric is so.
# Next, let's do a univariate analysis of the attributes, especially those we identified earlier as interesting based on the pandas profile report.
# We will also look at the outliers as part of this. We will deal with blanks after this initial analysis
cdata.head()
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Seats | New_Price | Price | Mileage_num | Engine_num | Power_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mumbai | 2010 | 72000 | CNG | Manual | First | 5.0 | 5.51 | 1.75 | 26.60 | 998.0 | 58.16 |
| 1 | Pune | 2015 | 41000 | Diesel | Manual | First | 5.0 | 16.06 | 12.50 | 19.67 | 1582.0 | 126.20 |
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | First | 5.0 | 8.61 | 4.50 | 18.20 | 1199.0 | 88.70 |
| 3 | Chennai | 2012 | 87000 | Diesel | Manual | First | 7.0 | 11.27 | 6.00 | 20.77 | 1248.0 | 88.76 |
| 4 | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 5.0 | 53.14 | 17.74 | 15.20 | 1968.0 | 140.80 |
# Let's get the percentile distribution etc for all columns again since we have new numeric columns
cdata.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year | 7253.0 | 2013.365366 | 3.254421 | 1996.00 | 2011.00 | 2014.00 | 2016.00 | 2019.00 |
| Kilometers_Driven | 7253.0 | 58699.063146 | 84427.720583 | 171.00 | 34000.00 | 53416.00 | 73000.00 | 6500000.00 |
| Seats | 7200.0 | 5.279722 | 0.811660 | 0.00 | 5.00 | 5.00 | 5.00 | 10.00 |
| New_Price | 7253.0 | 21.307322 | 24.256314 | 3.91 | 7.88 | 11.30 | 21.69 | 375.00 |
| Price | 6019.0 | 9.479468 | 11.187917 | 0.44 | 3.50 | 5.64 | 9.95 | 160.00 |
| Mileage_num | 7251.0 | 18.141580 | 4.562197 | 0.00 | 15.17 | 18.16 | 21.10 | 33.54 |
| Engine_num | 7207.0 | 1616.573470 | 595.285137 | 72.00 | 1198.00 | 1493.00 | 1968.00 | 5998.00 |
| Power_num | 7078.0 | 112.765214 | 53.493553 | 34.20 | 75.00 | 94.00 | 138.10 | 616.00 |
# We will use the utility functions provided in the reference solution of the previous project.
# While doing uni-variate analysis of numerical variables we want to study their central tendency
# and dispersion.
# Let us write a function that will help us create boxplot and histogram for any input numerical
# variable.
# This function takes the numerical column as the input and returns the boxplots
# and histograms for the variable.
# Let us see if this help us write faster and cleaner code.
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
# This code is also from the reference solution of the previous project.
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
histogram_boxplot(cdata, "Year")
histogram_boxplot(cdata, "Kilometers_Driven")
histogram_boxplot(cdata[cdata['Kilometers_Driven']<500000], "Kilometers_Driven")
# Let's take a look at the outliers. We will use the original of the dataset since we wnat to see the Name also.
cdata_orig[cdata_orig['Kilometers_Driven']>=500000]
| S.No. | Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_Price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 340 | 340 | Skoda Octavia Ambition Plus 2.0 TDI AT | Kolkata | 2013 | 775000 | Diesel | Automatic | First | 19.3 kmpl | 1968 CC | 141 bhp | 5.0 | 25.83 | 7.5 |
| 358 | 358 | Hyundai i10 Magna 1.2 | Chennai | 2009 | 620000 | Petrol | Manual | First | 20.36 kmpl | 1197 CC | 78.9 bhp | 5.0 | 10.25 | 2.7 |
| 1860 | 1860 | Volkswagen Vento Diesel Highline | Chennai | 2013 | 720000 | Diesel | Manual | First | 20.54 kmpl | 1598 CC | 103.6 bhp | 5.0 | 13.23 | 5.9 |
| 2328 | 2328 | BMW X5 xDrive 30d M Sport | Chennai | 2017 | 6500000 | Diesel | Automatic | First | 15.97 kmpl | 2993 CC | 258 bhp | 5.0 | 55.07 | 65.0 |
# Delete the suspicious rows identified above
print(cdata.shape)
cdata.drop(cdata[cdata['Kilometers_Driven'] >= 500000].index, inplace = True)
print(cdata.shape)
(7253, 12) (7249, 12)
histogram_boxplot(cdata, "Seats")
cdata_orig[cdata_orig['Seats']>=9]
| S.No. | Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_Price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 814 | 814 | Toyota Qualis FS B2 | Pune | 2004 | 77757 | Diesel | Manual | Second | 13.1 kmpl | 2446 CC | 75 bhp | 10.0 | 24.010 | 3.50 |
| 917 | 917 | Tata Sumo DX | Mumbai | 2013 | 20000 | Diesel | Manual | First | 14.07 kmpl | 1978 CC | 83.1 bhp | 10.0 | 7.630 | 5.25 |
| 1347 | 1347 | Mahindra Scorpio LX BS IV | Bangalore | 2007 | 74112 | Diesel | Manual | First | 12.05 kmpl | 2179 CC | 120 bhp | 9.0 | 18.865 | 4.10 |
| 1907 | 1907 | Toyota Qualis FS B3 | Bangalore | 2002 | 63000 | Diesel | Manual | Third | 13.1 kmpl | 2446 CC | 75 bhp | 10.0 | 24.010 | 3.65 |
| 2267 | 2267 | Toyota Qualis RS E2 | Pune | 2004 | 215750 | Diesel | Manual | Second | 0.0 kmpl | 2446 CC | NaN | 10.0 | 24.010 | 3.50 |
| 2312 | 2312 | Chevrolet Tavera LT 9 Str BS IV | Kolkata | 2012 | 67132 | Diesel | Manual | First | 13.58 kmpl | 2499 CC | 78 bhp | 9.0 | 11.300 | 4.75 |
| 2359 | 2359 | Mahindra Xylo D2 Maxx | Chennai | 2016 | 150000 | Diesel | Manual | First | 14.95 kmpl | 2489 CC | 93.7 bhp | 9.0 | 11.670 | 4.50 |
| 2575 | 2575 | Chevrolet Tavera LS B3 10 Seats BSIII | Hyderabad | 2015 | 120000 | Diesel | Manual | First | 14.8 kmpl | 2499 CC | 80 bhp | 10.0 | 11.300 | 5.50 |
| 6242 | 6242 | Tata Sumo EX 10/7 Str BSII | Chennai | 2015 | 196000 | Diesel | Manual | Second | 12.2 kmpl | 1948 CC | 68 bhp | 10.0 | 7.630 | NaN |
| 6288 | 6288 | Chevrolet Tavera LS B3 10 Seats BSIII | Hyderabad | 2005 | 150000 | Diesel | Manual | Second | 14.8 kmpl | 2499 CC | 80 bhp | 10.0 | 11.300 | NaN |
| 6875 | 6875 | Toyota Qualis FS B3 | Pune | 2002 | 119613 | Diesel | Manual | Second | 13.1 kmpl | 2446 CC | 75 bhp | 10.0 | 24.010 | NaN |
Based on the Name, it looks alright. For these models, 9/10 seats is okay.
cdata_orig[cdata_orig['Seats']<=1]
| S.No. | Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_Price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3999 | 3999 | Audi A4 3.2 FSI Tiptronic Quattro | Hyderabad | 2012 | 125000 | Petrol | Automatic | First | 10.5 kmpl | 3197 CC | NaN | 0.0 | 53.14 | 18.0 |
This is suspicious. 0 Seater? Also Power is NaN. We will delete this row. Alternately, we could have set this field to blank and then deal with this the way we deal with blanks later. But I think deleting is better as the row may have other issues as well.
# Let's delete the suspicious row
print(cdata.shape)
cdata.drop(cdata[cdata['Seats'] <= 1].index, inplace = True)
print(cdata.shape)
(7249, 12) (7248, 12)
histogram_boxplot(cdata, "New_Price")
# Let's see how the log of New_Price looks
New_Price_log = pd.DataFrame(np.log(cdata['New_Price']))
New_Price_log.columns =['New_Price_log']
histogram_boxplot(New_Price_log, "New_Price_log")
histogram_boxplot(cdata, "Price")
Price_log = pd.DataFrame(np.log(cdata['Price']))
Price_log.columns =['Price_log']
histogram_boxplot(Price_log, "Price_log")
histogram_boxplot(cdata, "Power_num")
histogram_boxplot(cdata, "Mileage_num")
# Let's look at cases where mileage is < 1
cdata[cdata['Mileage_num']<1]
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Seats | New_Price | Price | Mileage_num | Engine_num | Power_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14 | Pune | 2012 | 85000 | Diesel | Automatic | Second | 5.0 | 120.000 | 17.50 | 0.0 | 2179.0 | 115.0 |
| 67 | Coimbatore | 2019 | 15369 | Diesel | Automatic | First | 5.0 | 49.140 | 35.67 | 0.0 | 1950.0 | 194.0 |
| 79 | Hyderabad | 2005 | 87591 | Petrol | Manual | First | 5.0 | 4.550 | 1.30 | 0.0 | 1086.0 | NaN |
| 194 | Ahmedabad | 2007 | 60006 | Petrol | Manual | First | NaN | 13.580 | 2.95 | 0.0 | NaN | NaN |
| 229 | Bangalore | 2015 | 70436 | Diesel | Manual | First | NaN | 7.650 | 3.60 | 0.0 | 1498.0 | 99.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6633 | Kolkata | 2016 | 27000 | Diesel | Manual | First | NaN | 18.865 | NaN | 0.0 | NaN | NaN |
| 6697 | Jaipur | 2007 | 85000 | Petrol | Manual | Second | 5.0 | 4.550 | NaN | 0.0 | 1086.0 | NaN |
| 6857 | Mumbai | 2011 | 87000 | Diesel | Automatic | First | 5.0 | 120.000 | NaN | 0.0 | 2179.0 | 115.0 |
| 6957 | Kochi | 2019 | 11574 | Petrol | Manual | First | NaN | 9.600 | NaN | 0.0 | 1199.0 | 88.7 |
| 7226 | Ahmedabad | 2014 | 41000 | Petrol | Manual | First | 5.0 | 4.550 | NaN | 0.0 | 1086.0 | 62.0 |
81 rows × 12 columns
# Let's see these rows in the orginal data set so that we can see all columns
cdata_orig[(~ cdata_orig['Mileage'].isnull()) & cdata_orig['Mileage'].str.startswith('0')]
| S.No. | Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_Price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14 | 14 | Land Rover Freelander 2 TD4 SE | Pune | 2012 | 85000 | Diesel | Automatic | Second | 0.0 kmpl | 2179 CC | 115 bhp | 5.0 | 120.000 | 17.50 |
| 67 | 67 | Mercedes-Benz C-Class Progressive C 220d | Coimbatore | 2019 | 15369 | Diesel | Automatic | First | 0.0 kmpl | 1950 CC | 194 bhp | 5.0 | 49.140 | 35.67 |
| 79 | 79 | Hyundai Santro Xing XL | Hyderabad | 2005 | 87591 | Petrol | Manual | First | 0.0 kmpl | 1086 CC | NaN | 5.0 | 4.550 | 1.30 |
| 194 | 194 | Honda City 1.5 GXI | Ahmedabad | 2007 | 60006 | Petrol | Manual | First | 0.0 kmpl | NaN | NaN | NaN | 13.580 | 2.95 |
| 229 | 229 | Ford Figo Diesel | Bangalore | 2015 | 70436 | Diesel | Manual | First | 0.0 kmpl | 1498 CC | 99 bhp | NaN | 7.650 | 3.60 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6633 | 6633 | Mahindra TUV 300 P4 | Kolkata | 2016 | 27000 | Diesel | Manual | First | 0.0 kmpl | NaN | NaN | NaN | 18.865 | NaN |
| 6697 | 6697 | Hyundai Santro Xing XL | Jaipur | 2007 | 85000 | Petrol | Manual | Second | 0.0 kmpl | 1086 CC | NaN | 5.0 | 4.550 | NaN |
| 6857 | 6857 | Land Rover Freelander 2 TD4 SE | Mumbai | 2011 | 87000 | Diesel | Automatic | First | 0.0 kmpl | 2179 CC | 115 bhp | 5.0 | 120.000 | NaN |
| 6957 | 6957 | Honda Jazz 2020 Petrol | Kochi | 2019 | 11574 | Petrol | Manual | First | 0.0 kmpl | 1199 CC | 88.7 bhp | NaN | 9.600 | NaN |
| 7226 | 7226 | Hyundai Santro Xing GL | Ahmedabad | 2014 | 41000 | Petrol | Manual | First | 0.0 kmpl | 1086 CC | 62 bhp | 5.0 | 4.550 | NaN |
81 rows × 14 columns
# Our best bet would be to delete these rows.
# Let's delete the suspicious rows. Another option would be set the Mileage to blank and deal with these later when we deal with blanks.
# But I prefer deleting them because other fields of these rows may also ne wrong.
print(cdata.shape)
cdata.drop(cdata[cdata['Mileage_num'] <= 1].index, inplace = True)
print(cdata.shape)
(7248, 12) (7167, 12)
histogram_boxplot(cdata, "Engine_num")
# We will look at bar plots for categorical columns that are also of String type.
labeled_barplot(cdata, "Location", perc=True)
# We will look at bar plots for categorical columns that are also of String type.
labeled_barplot(cdata, "Fuel_Type", perc=True)
# We will look at bar plots for categorical columns that are also of String type.
labeled_barplot(cdata, "Transmission", perc=True)
# We will look at bar plots for categorical columns that are also of String type.
labeled_barplot(cdata, "Owner_Type", perc=True)
cdata.head()
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Seats | New_Price | Price | Mileage_num | Engine_num | Power_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mumbai | 2010 | 72000 | CNG | Manual | First | 5.0 | 5.51 | 1.75 | 26.60 | 998.0 | 58.16 |
| 1 | Pune | 2015 | 41000 | Diesel | Manual | First | 5.0 | 16.06 | 12.50 | 19.67 | 1582.0 | 126.20 |
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | First | 5.0 | 8.61 | 4.50 | 18.20 | 1199.0 | 88.70 |
| 3 | Chennai | 2012 | 87000 | Diesel | Manual | First | 7.0 | 11.27 | 6.00 | 20.77 | 1248.0 | 88.76 |
| 4 | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 5.0 | 53.14 | 17.74 | 15.20 | 1968.0 | 140.80 |
# Let's now handle the missing values
cdata.isnull().sum()
Location 0 Year 0 Kilometers_Driven 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Seats 39 New_Price 0 Price 1221 Mileage_num 2 Engine_num 34 Power_num 129 dtype: int64
# For numerical attributes, if it is skewed, we will use the median, otherwise we will use the mean.
# For categorical attributes, we will use the mode.
# Rows with missing target variable will get deleted.
# First let's rename the columns to get rid of the _num suffix
cdata.rename(columns = {'Mileage_num':'Mileage', 'Engine_num':'Engine','Power_num':'Power'}, inplace = True)
# Now, handle the blanks
cdata['Engine'] = cdata['Engine'].fillna(cdata['Engine'].median())
cdata['Power'] = cdata['Power'].fillna(cdata['Power'].median())
cdata['Seats'] = cdata['Seats'].fillna(cdata['Seats'].mode()[0])
cdata['Mileage'] = cdata['Mileage'].fillna(cdata['Mileage'].mean())
cdata.isnull().sum()
Location 0 Year 0 Kilometers_Driven 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Seats 0 New_Price 0 Price 1221 Mileage 0 Engine 0 Power 0 dtype: int64
# Rows with missing target variable will get deleted.
cdata.dropna(inplace=True)
cdata.isnull().sum()
Location 0 Year 0 Kilometers_Driven 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Seats 0 New_Price 0 Price 0 Mileage 0 Engine 0 Power 0 dtype: int64
# No more missing cells
cdata.head()
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Seats | New_Price | Price | Mileage | Engine | Power | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mumbai | 2010 | 72000 | CNG | Manual | First | 5.0 | 5.51 | 1.75 | 26.60 | 998.0 | 58.16 |
| 1 | Pune | 2015 | 41000 | Diesel | Manual | First | 5.0 | 16.06 | 12.50 | 19.67 | 1582.0 | 126.20 |
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | First | 5.0 | 8.61 | 4.50 | 18.20 | 1199.0 | 88.70 |
| 3 | Chennai | 2012 | 87000 | Diesel | Manual | First | 7.0 | 11.27 | 6.00 | 20.77 | 1248.0 | 88.76 |
| 4 | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 5.0 | 53.14 | 17.74 | 15.20 | 1968.0 | 140.80 |
# Let's transform the owner_type attribute
# These are the current value counts.
cdata['Owner_Type'].value_counts()
First 4881 Second 948 Third 109 Fourth & Above 8 Name: Owner_Type, dtype: int64
# Set the value in a new column to 1-4 based on owner_type
cdata.loc[cdata.Owner_Type == "First", "Owner_Count"] = 1
cdata.loc[cdata.Owner_Type == "Second", "Owner_Count"] = 2
cdata.loc[cdata.Owner_Type == "Third", "Owner_Count"] = 3
cdata.loc[cdata.Owner_Type == "Fourth & Above", "Owner_Count"] = 4
print(cdata['Owner_Count'].value_counts())
1.0 4881 2.0 948 3.0 109 4.0 8 Name: Owner_Count, dtype: int64
# Let's set the new attribute to category type
cdata['Owner_Count'] = cdata['Owner_Count'].astype('category')
# Let's drop the owner_type attribute
cdata = cdata.drop(['Owner_Type'], axis = 1)
cdata.head()
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Seats | New_Price | Price | Mileage | Engine | Power | Owner_Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mumbai | 2010 | 72000 | CNG | Manual | 5.0 | 5.51 | 1.75 | 26.60 | 998.0 | 58.16 | 1.0 |
| 1 | Pune | 2015 | 41000 | Diesel | Manual | 5.0 | 16.06 | 12.50 | 19.67 | 1582.0 | 126.20 | 1.0 |
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | 5.0 | 8.61 | 4.50 | 18.20 | 1199.0 | 88.70 | 1.0 |
| 3 | Chennai | 2012 | 87000 | Diesel | Manual | 7.0 | 11.27 | 6.00 | 20.77 | 1248.0 | 88.76 | 1.0 |
| 4 | Coimbatore | 2013 | 40670 | Diesel | Automatic | 5.0 | 53.14 | 17.74 | 15.20 | 1968.0 | 140.80 | 2.0 |
# Let's do a describe before moving on to bivariate analysis
cdata.describe(include = 'all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Location | 5946 | 11 | Mumbai | 785 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Year | 5946 | NaN | NaN | NaN | 2013.41 | 3.22101 | 1998 | 2012 | 2014 | 2016 | 2019 |
| Kilometers_Driven | 5946 | NaN | NaN | NaN | 57154.2 | 34826.6 | 171 | 33900.2 | 53000 | 72671 | 480000 |
| Fuel_Type | 5946 | 5 | Diesel | 3175 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Transmission | 5946 | 2 | Manual | 4254 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Seats | 5946 | NaN | NaN | NaN | 5.27985 | 0.803517 | 2 | 5 | 5 | 5 | 10 |
| New_Price | 5946 | NaN | NaN | NaN | 21.3816 | 24.0986 | 3.91 | 7.88 | 11.3 | 21.77 | 375 |
| Price | 5946 | NaN | NaN | NaN | 9.44904 | 11.1184 | 0.44 | 3.5 | 5.65 | 9.915 | 160 |
| Mileage | 5946 | NaN | NaN | NaN | 18.3431 | 4.17498 | 6.4 | 15.3 | 18.2 | 21.1 | 33.54 |
| Engine | 5946 | NaN | NaN | NaN | 1618.94 | 597.214 | 72 | 1198 | 1493 | 1968 | 5998 |
| Power | 5946 | NaN | NaN | NaN | 112.765 | 53.3815 | 34.2 | 76.925 | 94 | 138.03 | 560 |
| Owner_Count | 5946 | 4 | 1 | 4881 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# Let's start with a heatmap showing the correlation between various numerical attributes
plt.figure(figsize=(15, 7))
sns.heatmap(cdata.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
# Let's see the pairwise relationship look like for different Fuel_Type, Transmission type and Location. Let's start with Fuel_Type.
# To avoid cluttering, let's look at these two Fuel types only. Since they represent most of the data.
options = ['Petrol', 'Diesel']
sns.pairplot(data=cdata[cdata["Fuel_Type"].isin(options) != 0],hue="Fuel_Type", diag_kind="kde")
plt.show()
#Note that in the plots below, orange is Diesel and blue is Petrol
# Now let's get a high level analysis based on Transmission
sns.pairplot(data=cdata,hue="Transmission", diag_kind="kde")
plt.show()
#Note that in the plots below, orange is Diesel and blue is Petrol
# Mileage vs Transmission
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Transmission"],cdata["Mileage"],palette="PuBu")
plt.show()
# Fuel_Type vs Transmission
plt.figure(figsize=(15,7))
sns.countplot(x='Fuel_Type',data=cdata, palette='rainbow',hue='Transmission')
plt.legend(loc='upper left')
plt.title("Count of cars of different fuel type, Separated by Transmission");
# Location vs Transmission
plt.figure(figsize=(15,7))
sns.countplot(x='Location',data=cdata, palette='rainbow',hue='Transmission')
plt.legend(loc='upper left')
plt.title("Count of cars of different locations, Separated by Transmission");
# Owner_Count vs Transmission
plt.figure(figsize=(15,7))
sns.countplot(x='Owner_Count',data=cdata, palette='rainbow',hue='Transmission')
plt.legend(loc='upper left')
plt.title("Count of cars of different owner counts, Separated by Transmission");
# Year vs Transmission
plt.figure(figsize=(15,7))
sns.countplot(x='Year',data=cdata, palette='rainbow',hue='Transmission')
plt.legend(loc='upper left')
plt.title("Count of cars over the years, Separated by Transmission");
# Engine vs Transmission
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Transmission"],cdata["Engine"],palette="PuBu")
plt.show()
# Power vs Transmission
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Transmission"],cdata["Power"],palette="PuBu")
plt.show()
# New Price vs Transmission
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Transmission"],cdata["New_Price"],palette="PuBu")
plt.show()
# Price vs Transmission
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Transmission"],cdata["Price"],palette="PuBu")
plt.show()
# Seats vs Transmission
plt.figure(figsize=(15,7))
sns.countplot(x='Seats',data=cdata, palette='rainbow',hue='Transmission')
plt.legend(loc='upper left')
plt.title("Count of cars of different number of seats, Separated by Transmission");
# Kilometers_Driven vs Transmission
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Transmission"],cdata["Kilometers_Driven"],palette="PuBu")
plt.show()
# Fuel_Type vs Mileage
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Fuel_Type"],cdata["Mileage"],palette="PuBu")
plt.show()
# Fuel_Type vs Mileage
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Location"],cdata["Mileage"],palette="PuBu")
plt.show()
# Fuel_Type vs Mileage
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Owner_Count"],cdata["Mileage"],palette="PuBu")
plt.show()
# Year vs Mileage
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Year"],cdata["Mileage"],palette="PuBu")
plt.show()
# Engine vs Mileage
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Seats"],cdata["Mileage"],palette="PuBu")
plt.show()
# Location vs Fuel_Type
plt.figure(figsize=(15,7))
sns.countplot(x='Location',data=cdata, palette='rainbow',hue='Fuel_Type')
plt.legend(loc='upper left')
plt.title("Count of cars at different locations, Separated by Fuel_Type");
# Location vs Fuel_Type
plt.figure(figsize=(15,7))
sns.countplot(x='Owner_Count',data=cdata, palette='rainbow',hue='Fuel_Type')
plt.legend(loc='upper right')
plt.title("Count of cars of different owner counts, Separated by Fuel_Type");
# Year vs Fuel_Type
plt.figure(figsize=(15,7))
sns.countplot(x='Year',data=cdata, palette='rainbow',hue='Fuel_Type')
plt.legend(loc='upper left')
plt.title("Count of cars of different years, Separated by Fuel_Type");
# Fuel_Type vs Engine
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Fuel_Type"],cdata["Engine"],palette="PuBu")
plt.show()
# Fuel_Type vs Power
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Fuel_Type"],cdata["Power"],palette="PuBu")
plt.show()
# New_Price vs Fuel_Type
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Fuel_Type"],cdata["New_Price"],palette="PuBu")
plt.show()
# Fuel_Type vs Price
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Fuel_Type"],cdata["Price"],palette="PuBu")
plt.show()
# Kilometers_Driven vs Price
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Fuel_Type"],cdata["Kilometers_Driven"],palette="PuBu")
plt.show()
# Seats vs Fuel_Type
plt.figure(figsize=(15,7))
sns.countplot(x='Seats',data=cdata, palette='rainbow',hue='Fuel_Type')
plt.legend(loc='upper left')
plt.title("Count of cars of different seat capacity, Separated by Fuel_Type");
# Owner_Count vs Location
plt.figure(figsize=(15,7))
sns.countplot(x='Location',data=cdata, palette='rainbow',hue='Owner_Count')
plt.legend(loc='upper left')
plt.title("Count of cars of different owner count, Separated by Location");
# Year vs Location
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Location"],cdata["Year"],palette="PuBu")
plt.show()
# Year vs Location
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Location"],cdata["Engine"],palette="PuBu")
plt.show()
# Year vs Location
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Location"],cdata["Power"],palette="PuBu")
plt.show()
# Year vs Location
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Location"],cdata["New_Price"],palette="PuBu")
plt.show()
# Year vs Location
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Location"],cdata["Price"],palette="PuBu")
plt.show()
# Seats vs Location
plt.figure(figsize=(15,7))
sns.countplot(x='Location',data=cdata, palette='rainbow',hue='Seats')
plt.legend(loc='upper left')
plt.title("Count of cars of different seat capacity, Separated by Location");
# Year vs Location
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Location"],cdata["Kilometers_Driven"],palette="PuBu")
plt.show()
# Year vs Owner_Count
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Owner_Count"],cdata["Year"],palette="PuBu")
plt.show()
# Engine vs Owner_Count
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Owner_Count"],cdata["Engine"],palette="PuBu")
plt.show()
# Power vs Owner_Count
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Owner_Count"],cdata["Power"],palette="PuBu")
plt.show()
# New_Price vs Owner_Count
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Owner_Count"],cdata["New_Price"],palette="PuBu")
plt.show()
# Price vs Owner_Count
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Owner_Count"],cdata["Price"],palette="PuBu")
plt.show()
# Seats vs Owner_Count
plt.figure(figsize=(15,7))
sns.countplot(x='Owner_Count',data=cdata, palette='rainbow',hue='Seats')
plt.legend(loc='upper left')
plt.title("Count of cars of different seat capacity, Separated by Owner count");
# Kilometers_Driven vs Owner_Count
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Owner_Count"],cdata["Kilometers_Driven"],palette="PuBu")
plt.show()
# Year vs Engine
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Year"],cdata["Engine"],palette="PuBu")
plt.show()
# Year vs Power
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Year"],cdata["Power"],palette="PuBu")
plt.show()
# Year vs New_Price
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Year"],cdata["New_Price"],palette="PuBu")
plt.show()
# Year vs New_Price
plt.figure(figsize=(15,7))
cdata_ignore_vcostly = cdata[cdata["New_Price"]<= 100]
sns.boxplot(cdata_ignore_vcostly["Year"],cdata_ignore_vcostly["New_Price"],palette="PuBu")
plt.show()
# Year vs Price
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Year"],cdata["Price"],palette="PuBu")
plt.show()
# Year vs Seats
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Seats"],cdata["Year"],palette="PuBu")
plt.show()
# Year vs Kilometers_Driven
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Year"],cdata["Kilometers_Driven"],palette="PuBu")
plt.show()
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Fuel_Type"],cdata["Price"],hue=cdata["Transmission"])
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()
# Let's see what Transmission types are available for which fuel types
cdata[["Fuel_Type","Transmission"]].value_counts(sort=False)
Fuel_Type Transmission
CNG Manual 56
Diesel Automatic 1085
Manual 2090
Electric Automatic 2
LPG Manual 10
Petrol Automatic 605
Manual 2098
dtype: int64
plt.figure(figsize=(15,7))
sns.boxplot(cdata["Fuel_Type"],cdata["Kilometers_Driven"],hue=cdata["Transmission"])
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()
plt.figure(figsize=(15,7))
sns.lineplot(cdata["Year"],cdata["New_Price"],hue=cdata["Transmission"],ci=0)
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()
plt.figure(figsize=(15,7))
sns.lineplot(cdata["Year"],cdata["Price"],hue=cdata["Transmission"],ci=0)
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()
plt.figure(figsize=(15,7))
sns.lineplot(cdata["Year"],cdata["New_Price"],hue=cdata["Location"],ci=0)
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()
plt.figure(figsize=(15,7))
sns.lineplot(cdata["Year"],cdata["Price"],hue=cdata["Location"],ci=0)
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()
plt.figure(figsize=(15,7))
sns.lineplot(cdata["Year"],cdata["New_Price"],hue=cdata["Seats"],ci=0)
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()
plt.figure(figsize=(15,7))
sns.lineplot(cdata["Year"],cdata["Price"],hue=cdata["Seats"],ci=0)
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()
# Now we are ready to start the linear regression modelling. Let's look at the data and data types one more time
cdata.head()
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Seats | New_Price | Price | Mileage | Engine | Power | Owner_Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Mumbai | 2010 | 72000 | CNG | Manual | 5.0 | 5.51 | 1.75 | 26.60 | 998.0 | 58.16 | 1.0 |
| 1 | Pune | 2015 | 41000 | Diesel | Manual | 5.0 | 16.06 | 12.50 | 19.67 | 1582.0 | 126.20 | 1.0 |
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | 5.0 | 8.61 | 4.50 | 18.20 | 1199.0 | 88.70 | 1.0 |
| 3 | Chennai | 2012 | 87000 | Diesel | Manual | 7.0 | 11.27 | 6.00 | 20.77 | 1248.0 | 88.76 | 1.0 |
| 4 | Coimbatore | 2013 | 40670 | Diesel | Automatic | 5.0 | 53.14 | 17.74 | 15.20 | 1968.0 | 140.80 | 2.0 |
cdata.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5946 entries, 0 to 6018 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Location 5946 non-null category 1 Year 5946 non-null int64 2 Kilometers_Driven 5946 non-null int64 3 Fuel_Type 5946 non-null category 4 Transmission 5946 non-null category 5 Seats 5946 non-null float64 6 New_Price 5946 non-null float64 7 Price 5946 non-null float64 8 Mileage 5946 non-null float64 9 Engine 5946 non-null float64 10 Power 5946 non-null float64 11 Owner_Count 5946 non-null category dtypes: category(4), float64(6), int64(2) memory usage: 602.2 KB
# Let's do 1 hot encoding of text attributes
cdata1h = pd.get_dummies(cdata, columns = ['Location', 'Fuel_Type','Transmission'],drop_first=True)
pd.set_option("display.max_columns", None)
cdata1h.head()
| Year | Kilometers_Driven | Seats | New_Price | Price | Mileage | Engine | Power | Owner_Count | Location_Bangalore | Location_Chennai | Location_Coimbatore | Location_Delhi | Location_Hyderabad | Location_Jaipur | Location_Kochi | Location_Kolkata | Location_Mumbai | Location_Pune | Fuel_Type_Diesel | Fuel_Type_Electric | Fuel_Type_LPG | Fuel_Type_Petrol | Transmission_Manual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2010 | 72000 | 5.0 | 5.51 | 1.75 | 26.60 | 998.0 | 58.16 | 1.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 2015 | 41000 | 5.0 | 16.06 | 12.50 | 19.67 | 1582.0 | 126.20 | 1.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
| 2 | 2011 | 46000 | 5.0 | 8.61 | 4.50 | 18.20 | 1199.0 | 88.70 | 1.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 3 | 2012 | 87000 | 7.0 | 11.27 | 6.00 | 20.77 | 1248.0 | 88.76 | 1.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 4 | 2013 | 40670 | 5.0 | 53.14 | 17.74 | 15.20 | 1968.0 | 140.80 | 2.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
print(cdata1h.shape)
cdata1h.info()
(5946, 24) <class 'pandas.core.frame.DataFrame'> Int64Index: 5946 entries, 0 to 6018 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 5946 non-null int64 1 Kilometers_Driven 5946 non-null int64 2 Seats 5946 non-null float64 3 New_Price 5946 non-null float64 4 Price 5946 non-null float64 5 Mileage 5946 non-null float64 6 Engine 5946 non-null float64 7 Power 5946 non-null float64 8 Owner_Count 5946 non-null category 9 Location_Bangalore 5946 non-null uint8 10 Location_Chennai 5946 non-null uint8 11 Location_Coimbatore 5946 non-null uint8 12 Location_Delhi 5946 non-null uint8 13 Location_Hyderabad 5946 non-null uint8 14 Location_Jaipur 5946 non-null uint8 15 Location_Kochi 5946 non-null uint8 16 Location_Kolkata 5946 non-null uint8 17 Location_Mumbai 5946 non-null uint8 18 Location_Pune 5946 non-null uint8 19 Fuel_Type_Diesel 5946 non-null uint8 20 Fuel_Type_Electric 5946 non-null uint8 21 Fuel_Type_LPG 5946 non-null uint8 22 Fuel_Type_Petrol 5946 non-null uint8 23 Transmission_Manual 5946 non-null uint8 dtypes: category(1), float64(6), int64(2), uint8(15) memory usage: 671.2 KB
# As we had seen earlier, log makes sense for the new price and price.
cdata1h['New_Price_log']=np.log(cdata1h['New_Price'])
cdata1h['Price_log']=np.log(cdata1h['Price'])
# Just check again all columns look as they should
cdata1h.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Year | 5946.0 | NaN | NaN | NaN | 2013.406828 | 3.221012 | 1998.000000 | 2012.000000 | 2014.000000 | 2016.000000 | 2019.000000 |
| Kilometers_Driven | 5946.0 | NaN | NaN | NaN | 57154.150521 | 34826.615482 | 171.000000 | 33900.250000 | 53000.000000 | 72671.000000 | 480000.000000 |
| Seats | 5946.0 | NaN | NaN | NaN | 5.279852 | 0.803517 | 2.000000 | 5.000000 | 5.000000 | 5.000000 | 10.000000 |
| New_Price | 5946.0 | NaN | NaN | NaN | 21.381640 | 24.098560 | 3.910000 | 7.880000 | 11.300000 | 21.770000 | 375.000000 |
| Price | 5946.0 | NaN | NaN | NaN | 9.449041 | 11.118352 | 0.440000 | 3.500000 | 5.650000 | 9.915000 | 160.000000 |
| Mileage | 5946.0 | NaN | NaN | NaN | 18.343102 | 4.174979 | 6.400000 | 15.300000 | 18.200000 | 21.100000 | 33.540000 |
| Engine | 5946.0 | NaN | NaN | NaN | 1618.941473 | 597.213566 | 72.000000 | 1198.000000 | 1493.000000 | 1968.000000 | 5998.000000 |
| Power | 5946.0 | NaN | NaN | NaN | 112.765286 | 53.381456 | 34.200000 | 76.925000 | 94.000000 | 138.030000 | 560.000000 |
| Owner_Count | 5946.0 | 4.0 | 1.0 | 4881.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Location_Bangalore | 5946.0 | NaN | NaN | NaN | 0.058695 | 0.235073 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| Location_Chennai | 5946.0 | NaN | NaN | NaN | 0.081231 | 0.273213 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| Location_Coimbatore | 5946.0 | NaN | NaN | NaN | 0.105785 | 0.307589 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| Location_Delhi | 5946.0 | NaN | NaN | NaN | 0.092836 | 0.290226 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| Location_Hyderabad | 5946.0 | NaN | NaN | NaN | 0.122772 | 0.328202 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| Location_Jaipur | 5946.0 | NaN | NaN | NaN | 0.068618 | 0.252824 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| Location_Kochi | 5946.0 | NaN | NaN | NaN | 0.108981 | 0.311641 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| Location_Kolkata | 5946.0 | NaN | NaN | NaN | 0.089304 | 0.285206 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| Location_Mumbai | 5946.0 | NaN | NaN | NaN | 0.132022 | 0.338543 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| Location_Pune | 5946.0 | NaN | NaN | NaN | 0.102758 | 0.303668 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| Fuel_Type_Diesel | 5946.0 | NaN | NaN | NaN | 0.533972 | 0.498886 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 |
| Fuel_Type_Electric | 5946.0 | NaN | NaN | NaN | 0.000336 | 0.018339 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| Fuel_Type_LPG | 5946.0 | NaN | NaN | NaN | 0.001682 | 0.040979 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| Fuel_Type_Petrol | 5946.0 | NaN | NaN | NaN | 0.454591 | 0.497976 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| Transmission_Manual | 5946.0 | NaN | NaN | NaN | 0.715439 | 0.451243 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 |
| New_Price_log | 5946.0 | NaN | NaN | NaN | 2.683481 | 0.790638 | 1.363537 | 2.064328 | 2.424803 | 3.080533 | 5.926926 |
| Price_log | 5946.0 | NaN | NaN | NaN | 1.828039 | 0.866223 | -0.820981 | 1.252763 | 1.731656 | 2.294048 | 5.075174 |
# Let's make a separate copy for the linear regression using log values
cdata1hl=cdata1h.copy()
# Drop the normal column from the one with log. And drop the log column from the one with normal values.
cdata1hl = cdata1hl.drop(['New_Price', 'Price'], axis = 1)
cdata1h = cdata1h.drop(['New_Price_log', 'Price_log'], axis = 1)
cdata1hl.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5946 entries, 0 to 6018 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 5946 non-null int64 1 Kilometers_Driven 5946 non-null int64 2 Seats 5946 non-null float64 3 Mileage 5946 non-null float64 4 Engine 5946 non-null float64 5 Power 5946 non-null float64 6 Owner_Count 5946 non-null category 7 Location_Bangalore 5946 non-null uint8 8 Location_Chennai 5946 non-null uint8 9 Location_Coimbatore 5946 non-null uint8 10 Location_Delhi 5946 non-null uint8 11 Location_Hyderabad 5946 non-null uint8 12 Location_Jaipur 5946 non-null uint8 13 Location_Kochi 5946 non-null uint8 14 Location_Kolkata 5946 non-null uint8 15 Location_Mumbai 5946 non-null uint8 16 Location_Pune 5946 non-null uint8 17 Fuel_Type_Diesel 5946 non-null uint8 18 Fuel_Type_Electric 5946 non-null uint8 19 Fuel_Type_LPG 5946 non-null uint8 20 Fuel_Type_Petrol 5946 non-null uint8 21 Transmission_Manual 5946 non-null uint8 22 New_Price_log 5946 non-null float64 23 Price_log 5946 non-null float64 dtypes: category(1), float64(6), int64(2), uint8(15) memory usage: 671.2 KB
cdata1h.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5946 entries, 0 to 6018 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 5946 non-null int64 1 Kilometers_Driven 5946 non-null int64 2 Seats 5946 non-null float64 3 New_Price 5946 non-null float64 4 Price 5946 non-null float64 5 Mileage 5946 non-null float64 6 Engine 5946 non-null float64 7 Power 5946 non-null float64 8 Owner_Count 5946 non-null category 9 Location_Bangalore 5946 non-null uint8 10 Location_Chennai 5946 non-null uint8 11 Location_Coimbatore 5946 non-null uint8 12 Location_Delhi 5946 non-null uint8 13 Location_Hyderabad 5946 non-null uint8 14 Location_Jaipur 5946 non-null uint8 15 Location_Kochi 5946 non-null uint8 16 Location_Kolkata 5946 non-null uint8 17 Location_Mumbai 5946 non-null uint8 18 Location_Pune 5946 non-null uint8 19 Fuel_Type_Diesel 5946 non-null uint8 20 Fuel_Type_Electric 5946 non-null uint8 21 Fuel_Type_LPG 5946 non-null uint8 22 Fuel_Type_Petrol 5946 non-null uint8 23 Transmission_Manual 5946 non-null uint8 dtypes: category(1), float64(6), int64(2), uint8(15) memory usage: 671.2 KB
# Let's start with the dataset with normal values of new price and price (not the one with log)
# Drop the dependent variable from X
X = cdata1h.drop(['Price'], axis=1)
# the dependent variable
y = cdata1h[['Price']]
# Let's define a few utility functions that will be used laser to print the regression model results
# This code is taken from the SLR_MLS2_AnimeRatingPrediction notebook provided by the faculty
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / (targets)) * 100
# Adding this function for mape that handles the division by zero issue when the actual value is zero.
def ss_mape_score(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / np.maximum(np.ones(len(y_true)), np.abs(y_true))))*100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = ss_mape_score(target, pred) # to compute MAPE
#mape = mean_absolute_percentage_error(target, pred)
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
# Split data set so 70% is used as training data and the other 30% is used for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
# Build the model on training data
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)
LinearRegression()
coef_df = pd.DataFrame(
np.append(regression_model.coef_, regression_model.intercept_),
index=X_train.columns.tolist() + ["Intercept"],
columns=["Coefficients"],
)
coef_df
| Coefficients | |
|---|---|
| Year | 0.920602 |
| Kilometers_Driven | -0.000029 |
| Seats | -0.616001 |
| New_Price | 0.177537 |
| Mileage | -0.113526 |
| Engine | 0.000719 |
| Power | 0.077535 |
| Owner_Count | -0.100741 |
| Location_Bangalore | 1.804338 |
| Location_Chennai | 1.243862 |
| Location_Coimbatore | 1.817538 |
| Location_Delhi | -0.207644 |
| Location_Hyderabad | 2.031476 |
| Location_Jaipur | 1.126324 |
| Location_Kochi | -0.405138 |
| Location_Kolkata | -1.154135 |
| Location_Mumbai | -0.752341 |
| Location_Pune | 0.518672 |
| Fuel_Type_Diesel | -0.711004 |
| Fuel_Type_Electric | 6.229380 |
| Fuel_Type_LPG | 0.554208 |
| Fuel_Type_Petrol | -2.374362 |
| Transmission_Manual | -0.586233 |
| Intercept | -1849.356530 |
# Score on training data
regression_model.score(X_train, y_train)
0.7575852999741766
# Score on test data
regression_model.score(X_test, y_test)
0.7802927703836928
# Checking model performance on train set
print("Training Performance\n")
lin_reg_model_train_perf = model_performance_regression(regression_model, X_train, y_train)
lin_reg_model_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 5.577382 | 3.075094 | 0.757585 | 0.756238 | 58.645313 |
# Checking model performance on train set
print("Training Performance\n")
lin_reg_model_test_perf = model_performance_regression(regression_model, X_test, y_test)
lin_reg_model_test_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 4.972583 | 2.980673 | 0.780293 | 0.777422 | 58.767603 |
# Now let's build the model with log values of New Price and Price
X = cdata1hl.drop(['Price_log'], axis=1)
# the dependent variable
y = cdata1hl[['Price_log']]
# Split data set so 70% is used as training data and the other 30% is used for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
# Build the regression model
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)
LinearRegression()
coef_df = pd.DataFrame(
np.append(regression_model.coef_, regression_model.intercept_),
index=X_train.columns.tolist() + ["Intercept"],
columns=["Coefficients"],
)
coef_df
| Coefficients | |
|---|---|
| Year | 0.120084 |
| Kilometers_Driven | -0.000001 |
| Seats | 0.024784 |
| Mileage | -0.003317 |
| Engine | 0.000094 |
| Power | 0.004222 |
| Owner_Count | -0.069679 |
| Location_Bangalore | 0.153747 |
| Location_Chennai | 0.034337 |
| Location_Coimbatore | 0.064197 |
| Location_Delhi | -0.064198 |
| Location_Hyderabad | 0.114417 |
| Location_Jaipur | -0.050477 |
| Location_Kochi | -0.057986 |
| Location_Kolkata | -0.237820 |
| Location_Mumbai | -0.060963 |
| Location_Pune | -0.048409 |
| Fuel_Type_Diesel | 0.059963 |
| Fuel_Type_Electric | 0.719276 |
| Fuel_Type_LPG | 0.055180 |
| Fuel_Type_Petrol | -0.107340 |
| Transmission_Manual | -0.143067 |
| New_Price_log | 0.441893 |
| Intercept | -241.536892 |
# Score on training data
regression_model.score(X_train, y_train)
0.9196053760742274
# Score on test data
regression_model.score(X_test, y_test)
0.9158555531771687
# Checking model performance on train set
print("Training Performance\n")
lin_reg_model_train_perf = model_performance_regression(regression_model, X_train, y_train)
lin_reg_model_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.244885 | 0.177645 | 0.919605 | 0.919159 | 11.46895 |
# Checking model performance on train set
print("Training Performance\n")
lin_reg_model_test_perf = model_performance_regression(regression_model, X_test, y_test)
lin_reg_model_test_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.252879 | 0.183052 | 0.915856 | 0.914756 | 12.067424 |
Based on raw values of Price and New_Price
Based on log values of Price and New_Price
We will be using metric functions defined in sklearn for RMSE, MAE, and $R^2$.
Functions to calculate adjusted $R^2$ defined.
Function that will print out all the above metrics in one go defined.
Due to log function returning 0 for Prices close to 1 (lakh), MAPE returns NaN. I have used a modified version of mape function (source: Stackoverflow)
Observations (the dataset with log values)
The train and test $R^2$ are 0.92 and 0.92, indicating that the model explains 92% of the total variation in the train and test sets respectively. Also, both scores are comparable.
RMSE values on the train and test sets are also comparable.
This shows that the model is not overfitting.
MAE indicates that our current model is able to predict (log) price within a mean error of 0.18 on the test set.
MAPE of 12 on the test data means that we are able to predict within ~12% of the log(price).
The overall performance is much better with log values than the model built earlier on raw New_Price and Price values.